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INTRODUCTION 

•wwwwww 


A*  SvvCvwu*  The  aPP1icat*on  of  near-infrared  reflectance  analysis  (NIRA) 

as  an  analytical  technique  has  been  concentrated  mainly  in  the  agricultural  area 
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where  it  originated.  These  agricultural  applications  are  characterized 
by  the  need  to  determine  a  limited  number  of  constituents  in  a  very  large 
number  of  individual  but  similar  samples.  In  contrast  to  this  situation, 
samples  encountered  in  most  industrial  analytical  laboratories  are  widely 
varied  in  kind  and  the  number  of  very  similar  samples  is  limited. 

If  NIRA  Is  to  be  applied  broadly  to  industrial  chemical  analysis,  it 
must  be  modified  to  sharply  reduce  the  developmental  effort  needed  to  set  up 
an  individual  method.  At  present,  the  establishment  of  a  NIRA  procedure 
requires  the  assembly  of  a  fairly  large  set  of  standard  samples  where  compo¬ 
sition  has  already  been  established  by  a  reference  method  or  methods.  The 
reflectance  of  the  samples  must  then  be  measured  at  a  substantial  number 
of  points  in  the  near-infrared  spectral  region,  and  the  resulting  data  sub¬ 
jected  to  a  multilinear  regression  algorithm.  This  algorithm  then  generates 
the  choice  of  analytical  wavelengths  and  yields  a  "correlation  equation" 
which  relates  concentration  of  desired  constituents  to  reflectance  at  various 
near-infrared  wavelengths.  This  latter  calculation  step  is  very  demanding 
of  computer  hardware  and  processing  time,  particularly  for  high-performance 
NIRA  instruments  that  cover  a  wide  range  of  wavelengths,  a  probable  requirement 
for  industrial  analysis.  In  fact,  these  computational  requirements  are  so 
demanding  that  they  have  often  forced  shortcuts  in  methods  development  and 

i 

optimization,  and  have  limited  the  performance  of  the  NIRA  system.  This 
limitation  is  manifested  by  Incomplete  optimization  of  the  analytical  wave¬ 
length  set,  whereby  either  too  few  wavelengths  are  examined,  abridged  wave- 


3 


length  selection  criteria  are  used,  or  too  few  or  Incorrect  wavelengths 
are  selected  for  the  NIRA  procedure. 

This  limitation  also  encourages  Incomplete  testing  and  evaluation  of 
the  developed  NIRA  method,  which  leads  to  the  widespread  use  of  poorly  under¬ 
stood  algorithms  and  the  retarded  development  of  Improved  ones.  This  lack  of 
understanding  and  the  workload  of  existing  methods  have  often  been  great 
enough  to  discourage  people  from  examining  the  NIRA  approach,  and  have  acted 
as  a  brake  on  Its  wider  acceptance  and  application. 

In  the  present  paper,  an  algorithm  is  described  and  evaluated  for  sub¬ 
stantially  accelerating  the  wavelength  and  calibration  coefficient  selection 
process  of  NIRA.  This  algorithm  Is  used  to  find  "correlation  equations" 
for  protein  in  wheat  and  benzene  in  a  hydrocarbon  mixture.  Bias-corrected 
standard  errors  of  prediction  obtained  with  the  new  algorithm  reached  0.26 
percent  protein  In  wheat  and  1.01  percent  benzene  by  volume.  Comparisons 
of  the  algorithm  with  several  others  based  on  regression  show  improvements  in 
computation  time  ranging  from  a  few  percent  to  as  much  as  200- fold.  It  is 
also  discussed  how  the  novel  method  might  prove  advantageous  in  the  reduction 
of  overfittlng  and  In  the  Improvement  of  NIRA  accuracy. 

B.  Calibration  Procedures  in  NIRA.  The  general  pattern  for  establishing 
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a  NIRA  calibration  is  described  In  a  review  article  by  Watson,  and  will  be 
briefly  summarized  here  for  clarity.  The  first  step  In  establishing  a  NIRA 
calibration  Is  to  obtain  a  sample  set  in  which  the  desired  characteristic  or 
sample  constituent  has  been  previously  determined  by  a  reference  chemical  or 
spectroscopic  technique.  An  example  of  such  a  set  would  be  wheat  samples  whose 
protein  content  had  been  established  by  Kjeldahl  determinations.  The  sample 
set  Is  randomly  divided  Into  two  subsets,  one  for  solving  the  regression 


procedure  (training)  and  one  for  testing  the  regression  (prediction).  Next, 
the  near-infrared  diffuse-reflectance  spectrum  for  each  sample  Is  obtained. 
The  spectra  from  the  training  set  are  then  analyzed  by  some  form  of  multiple 
linear  regression.  Typically,  -log  reflectance  values  (R)  are  regressed 
against  the  chemically  determined  concentrations  to  identify  a  group  of 
wavelengths  at  which  R  best  predicts  the  desired  constituent  in  the  training 
set.  A  number  of  alternative  linear  regression  techniques  are  currently 
available  to  establish  the  NIRA  calibration.  These  techniques  include  (but 
are  not  limited  to)  stepwise,  all  possible  combinations,  all  possible  pairs 
stepwise,  and  all  possible  triplets  stepwise. 

Stepwise  regression  is  well  known  in  statistical  applications.®  In 
its  most  general  form  a  stepwise  regression  algorithm  calculates  the  linear 
regression  between  two  sets  of  variables  and  establishes  a  statistical  con¬ 
fidence  level  to  their  degree  of  coherence.  It  then  adds  new  values  to  or 
deletes  old  ones  from  one  of  the  sets  in  an  attempt  to  improve  the  coherence; 
coherence  is  usually  expressed  in  terms  of  a  correlation  coefficient.  Pro¬ 
cedures  Involving  the  addition  or  deletion  of  values  are  called  forward 
stepwise  and  backward  stepwise  regression,  respectively.  In  its  application 
to  NIRA,  forward  stepwise  regression  involves  the  addition  of  R  values  at 
new  wavelengths  and  suffers  from  the  problem  that  the  newly  added  wavelength 
Is  often  not  the  best  wavelength  to  add.  Moreover,  background  Interferences 
can  cause  omission  of  an  important  wavelength.  Backward  stepwise  regression 
in  NIRA  requires  that  the  total  number  of  wavelengths  that  are  employed  be 
small  enough  that  the  regression  containing  all  wavelengths  can  be  calculated 
In  a  reasonable  time,  as  the  starting  point  for  the  backwards  stepping.  This 
requirement  Is  usually  Inconsistent  with  the  amount  of  data  generated  by  a 
spectral  scanning  Instrument. 


The  "all-posslble-combinatlons"  regression  Improves  upon  the  forward 
stepwise  approach.  In  that  background  Interferences  do  not  bias  the  selection 


of  wavelengths.  The  drawback  to  the  all-posslble-comblnatlons  technique 
Is  the  enormous  number  of  calculations  It  requires.  This  number  Is  equal 
to  2ra,  where  m  is  the  total  number  of  wavelengths  being  employed,  re¬ 
stricting  this  approach  to  those  applications  with  the  smallest  data  sets. 

In  an  effort  to  combine  the  advantages  of  the  stepwise  and  all-possible- 
combinations  methods,  several  hybrid  techniques  such  as  "all -possible-pairs 
stepwise"  and  "all-possible-triplets  stepwise"  have  been  developed.7  These 
techniques  begin  with  all  possible  pairs  or  triplets  of  wavelengths,  respectively, 
and  proceed  by  means  of  a  forward  stepwise  regression.  In  this  way,  the  best 
pair  or  triplet  of  wavelengths  can  not  be  hidden  by  background  interferences, 
yet  the  number  of  required  calculations  is  much  less  than  in  the  all-possible- 
combinations  method.  To  ensure  self-consistency,  one  of  the  wavelengths 
earlier  adopted  in  the  calibration  is  dropped  and  the  best  wavelength  to 
add  is  then  determined  by  stepwise  regression.  If  the  calibration  is  self- 
consistent,  this  new  wavelength  is  the  same  as  the  one  just  deleted.  If  not, 
the  new  wavelength  Is  retained,  a  different  one  deleted,  and  the  process 
repeated  until  the  wavelength  which  Is  deleted  is  subsequently  restored  by  the  re¬ 
gression  process. 

After  using  any  of  these  regression  techniques  one  obtains  a  calibration 
of  the  form: 


C  *  Bo  +  B1R1  +  B2R2  +  •••  +  BjRj  (1) 

where  Bo  ...  Bj  are  the  coefficients  of  Intercept  and  partial  slopes  from 


the  regression  equation*  R.  Is  -log  (reflectance)  of  the  sample  at  the  jth 
wavelength  and  C  Is  the  concentration  of  the  desired  species  In  the  sample. 

Once  the  Bo  through  coefficients  are  determined,  the  standard  deviation 
between  the  actual  and  predicted  concentrations  for  the  training  set  (corrected 
for  the  statistical  degrees  of  freedom)  Is  computed  and  called  the  "standard 
error  of  estimation"  (SEE).  The  mathematical  definition  of  SEE  is  given  in 
Eq.  (2). 

N$ 

SEE  =  f(N$  -  1  -  Nw)’*£  e‘l*  (2) 

l  i=l  J 

where  N  is  the  number  of  samples  in  the  training  set,  N  is  the  number  of 
s 

wavelengths  kept  and  e^  is  the  difference  between  the  true  component  concen¬ 
tration  and  the  value  predicted  by  Eq.  1  for  the  ith  sample. 

Next,  the  deduced  regression  equation  (Eq.  1)  is  used  to  calculate  the 
concentration  of  the  desired  constituent  in  each  of  the  samples  in  the  pre¬ 
diction  set.  From  these  computed  concentrations  and  those  known  from  the 
earlier  independent  chemical  analysis  (e.g.  Kjeldahl  determination),  another 
standard  deviation  is  determined,  termed  the  "standard  error  of  prediction" 
(SEP).  The  definition  of  SEP  is  given  In  Eq.  (3). 

Ns 

SEP  =  |(Nj  -  I)'1  £  e*j  *  (3) 

where  N*  Is  the  number  of  samples  in  the  prediction  set. 

The  value  of  SEP  is  typically  used  as  a  measure  of  the  performance  of 
Eq.  1;  however,  a  bias-corrected  SEP  better  estimates  how  well  the  calibration 
will  perform  in  the  field,  where  routine  comparisons  between  NIRA  results 
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and  results  from  the  reference  chemical  method  are  periodically  used  to 
adjust  the  long-term  drift  of  the  NIRA  spectrophotometer.  This  bias-corrected 
SEP  is  given  by  the  equation: 

N* 

SEP  (biased)  =  J(N^  -  l)"1  ]T  (e.  -Bias)2 J1'2  (4) 


where 

Ns 

Bias  =  (N')"1  Y.  ei  (5) 

i=l 

C.  Row- reduction.  Because  NIRA  is  similar  to  multi -component  uv-visible 

'VVAAAAAAAAAAAi 

spectrophotometry,  it  would  be  very  useful  to  transfer  the  knowledge  and 
technology  of  this  latter  field  to  NIRA.  Unfortunately,  this  transfer  is 
not  straightforward.  A  NIRA  spectrum  contains  virtually  no  peaks  attributable 
to  a  single  species,  so  Individual  "absorptivities"  cannot  be  measured  and 
background  corrections  are  very  complex.  In  fact,  it  was  this  very  complexity 
that  led  to  the  introduction  of  regression  techniques.  Unfortunately, 
multilinear  regression  techniques  are  very  easily  overfitted  and  can  be 
very  slow. 

In  an  attempt  to  reduce  the  overfitting  of  multilinear  regression  and 
shorten  computation  time,  a  simplifying  assumption  has  been  made  in  the 
present  study.  Specifically,  if  the  errors  in  the  reference  chemical  method 
and  in  the  measured  diffuse  reflectance  spectrin  are  small,  a  simple  linear- 
algebra  solution  of  j  unknowns  with  j  equations  will  give  a  good  first 
approximation  to  a  multi-linear  regression.  To  test  the  assumption,  the 

Q 

Gauss- Jordan  reduction  method  for  treating  linear  equations  was  used  to 
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solve  Eq.  (1)  for  several  NIRA  sample  sets.  The  authors  have  elected  to 
call  this  particular  application  of  Gauss-Jordan  reduction  "row- reduct Ion". 

I.  THEORY 

'WWW. 

Gauss- Jordan  reduction  Is  a  general  approach  to  solving  a  system  of  n 
equations  for  n  unknowns.  A  full  description  of  the  mathematics  involved 
can  be  found  in  Reference  8.  Briefly,  to  solve  for  a  single  variable  in 
a  system  of  equations  such  as  the  one  shown  In  Eq.  (6)  [which  can  be  re¬ 
written  in  matrix  form  as  Eq.  (7)3,  each  equation  can  be  multiplied  by  some 
constant  and  then  subtracted  from  another  equation.  For  example,  to  solve 
Eqs.  (6)  and  (7)  for  the  variable  x,  the  second  row  can  be  multiplied  by 
-1/2  and  the  third  row  multiplied  by  -3/2.  These  operations  transform  Eq. 
(7)  into  Eq.  (8). 


3x  +  2y  +  3z  =  16 
6x  +  2y  +  8z  =  28 

2x  +  6y  +  4z  =  26  (6) 


(7) 


(8) 


When  row  1  of  Eq.  (8)  Is  added  to  rows  2  and  3,  the  resulting  matrix  is 
shown  in  Eq.  (9). 


(9) 


This  process  can  be  continued  to  solve  for  y  and  z.  When  the  matrix  is 
entirely  solved,  it  is  in  the  form  shown  in  Eq.  (10), where  I  is  the  identity 


(10) 


matrix  and  Ap  is  the  answer  to  the  pth  variable  in  the  equation.  For  Eq. 

(6),  x  is  the  first  variable,  y  is  the  second  and  z  is  the  third,  and  their 
solutions  are  found  In  Ai,  A2  and  A3,  respectively. 

The  adaptation  of  Gauss- Jordan  reduction  to  row- reduction  NIRA 
is  straightforward.  The  calibration  of  a  NIRA  sample  set  proceeds  through 
the  collection  of  spectra  as  described  earlier.  After  the  diffuse  reflectance 
spectrum  of  each  sample  is  obtained,  the  first  j  [.where  j  is  the  number  of 
terms  in  Eq.  (1)]  reflectance  (R)  values  in  the  spectrum  for  each  sample 
In  the  training  set  are  placed  In  a  matrix.  It  is  Important  to  recognize 
that  the  j  reflectance  values  used  In  this  matrix  do  not  constitute  the 
entire  sample  spectrum.  Rather,  they  are  merely  the  first  j  values  of  the 


entire  spectrim.  The  standard  concentration  of  the  sought-for  species  in 
each  sample  of  the  training  set  Is  also  placed  In  the  matrix;  these  concen¬ 
trations  correspond  to  the  known  values  (right-hand  side)  of  Eq.  (4)  and 
are  termed  the  augmented  portion  of  the  matrix.  The  resulting  matrix  is 
shown  pictorially  in  Fig.  1.  Figure  1  is  just  the  matrix  form  of  the  set 
of  equations  (like  Eq.  1)  resulting  from  several  samples.  The  unknowns  on 
the  left-hand  side  of  the  matrix  illustrated  in  Fig.  1  correspond  to  the 
Bi  through  B.  values  of  Eq.  (1). 

J 

To  solve  the  matrix  in  Fig.  1  via  Gauss- Jordan  reduction,  the  matrix 
rows  are  rearranged  to  cause  the  largest  reflectance  (R)  value  in  each 
column  to  lie  on  the  diagonal.  This  rearrangement  is  called  row  interchanging. 
The  matrix  in  Fig.  1  is  then  solved  successively  for  each  Bn  term  as  described 
earlier;  as  a  consequence,  the  remaining  R  values  become  orthogonal  to  those 
which  were  used  to  solve  for  the  B  terms.  This  behavior  can  be  seen  in  Eq.  (9); 
the  first  row  of  Eq.  (9)  is  the  only  one  which  contains  information  about  the 
unknown  value  x.  When  the  matrix  is  completely  solved  and  reduced  to  the  form 
of  Eq.  (10),  the  first  row  is  orthogonal  to  the  rest  of  the  matrix  and  con¬ 
tains  information  only  about  x.  By  means  of  row  interchanging,  the  most 
mutually  orthogonal  samples  are  chosen  to  determine  the  Bn  terms. 

After  the  B  values  have  been  found,  the  solution  is  validated  by  com- 
n 

paring  actual  vs.  predicted  values  for  the  training  set  and  calculating  a 
SEE  and  a  correlation  coefficient  (r  value).  This  r  value  is  saved  for  com¬ 
parison  with  later  solutions. 

Once  the  r  value  for  the  first  matrix  has  been  computed,  the  column 
corresponding  to  the  wavelength  with  the  largest  B  multiplier  is  dropped 
from  the  matrix  and  the  R  values  for  the  next  (j  +  1)  wavelength  are  put 
in  its  place.  The  computation  and  matrix  solution  are  then  repeated.  After 


every  wavelength  that  was  recorded  in  the  original  spectrum  has  proceeded 
through  this  computation,  the  entire  process  is  repeated,  using  the  final 
matrix  as  a  starting  point.  All  wavelengths  are  again  stepped  through  the 
matrix  solution  procedure;  after  this  second  iteration,  the  combination 
which  gave  the  best  r  value  is  recalled  and  used  as  the  solution  to  Eq.  (1). 

In  the  wavelength-stepping  procedure,  the  dropping  of  the  colurai  with 
the  largest  B  value  has  an  interesting  effect.  If  reflectances  at  all  wave¬ 
lengths  have  roughly  equivalent  magnitudes,  a  reasonable  assumption  in  the  near 
infrared,  the  wavelength  with  the  largest  B  value  will  contain  the  most 

information  about  the  sought-for  species.  Because  it  is  this  "most  important" 
wavelength  that  is  dropped,  the  selection  operation  rapidly  collects  the  most 
orthogonal  wavelengths  (those  least  correlated  with  the  desired  constituent  and 
most  correlated  with  background).  This  same  selection  criterion  prevents  the 
matrix  from  becoming  "ill  determined"  and  therefore  subject  to  large  roundoff 
error.  An  "ill  determined"  matrix  typically  contains  very  large  positive 
and  negative  B  values  in  pairs.  Because  the  largest  positive  B  value  will 
be  dropped  by  the  selection  criterion  of  row  reduction,  the  ill -determined 
pairs  are  broken  up  and  the  matrix  becomes  well  behaved  and  less  subject  to 
roundoff  error. 

When  the  procedure  steps  through  the  wavelengths  a  second  time,  the 
same  selection  criterion  naturally  seeks  out  the  wavelength  best  correlated 
with  the  desired  constituent.  As  each  new  wavelength  is  added,  the  solution 
to  the  linear  equation  uses  all  of  the  collected  background  wavelengths  to 
calculate  a  background-corrected  calibration.  Because  the  best  correlation 
with  the  concentration  of  the  desired  constituent  is  stored,  the  wavelength 
which  is  retained  is  the  one  that  shows  the  greatest  ability  to  be  background- 
corrected. 


One  might  initially  surmise  that  it  would  be  better  to  drop  the  smallest 
rather  than  the  largest  B  value  during  the  wavelength-stepping  procedure. 
However,  because  bands  in  the  near- infrared  portion  of  the  spectrun  are 
strongly  overlapped,  precise  background  correction  is  critical  for  a  success¬ 
ful  calibration.  Dropping  the  smallest  B  value. during  the  row-reduction 
process  would  keep  only  those  wavelengths  which  are  most  highly  correlated 
with  the  desired  constituent  and  would  fail  to  provide  adequate  background 
correction. 

II.  EXPERIMENTAL 

'WWWWWAA, 

A  set  of  simulated  spectra  was  used  initially  to  test  the  row-reduction 
algorithm.  Four  series  of  random  numbers  were  used  to  simulate  the  absorbance 
spectrum  of  four  pseudo-species  at  15  pseudo-wavelengths  in  each  spectrum. 

Ten  pseudo-samples  were  generated  by  combining  randomly  selected  amounts  of 
each  of  the  four  pseudo-species.  The  spectrum  of  each  sample  was  then  cal¬ 
culated  from  a  strict  application  of  Beer's  law,  assuming  additivity  of  the 
absorbances  of  the  sample  constituents.  After  the  simulated  spectra  were 
computed,  various  levels  of  random  noise  were  added  to  the  spectral  and  con¬ 
centration  values. 

In  the  first  real  test  of  the  new  algorithm,  a  set  of  absorbance  data 
for  methyl-red  and  methyl -orange  mixtures,  obtained  from  reference  9,  was 
used  to  predict  solution  pH.  The  data  consisted  of  absorbances  obtained 
at  discrete  wavelengths  ranging  from  375  to  575  nm. 

In  order  to  compare  the  new  algorithm  with  those  employed  earlier,  a 
set  of  100  near-infrared  diffuse- reflectance  spectra  of  ground  wheat  samples 
was  obtained  from  the  USDA,  Beltsville,  MD,  and  used  to  predict  the  percent 
protein  in  wheat.  Each  sample  had  been  assayed  for  protein  by  32  replicate 
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Kjeldahl  determinations.  The  exact  description  of  the  data  set  has  been 
published  elsewhere.^  The  data  were  used  as  received  with  the  exception 
that  only  every  fourth  wavelength  was  considered,  for  a  total  of  125  wave¬ 
lengths.  These  125  wavelengths  ranged  from  1  to  2.6  pm  in  increments  of 
12.8  nm.  The  reported  instrumental  bandpass  was  7  nm  and  no  spectral 
averaging  was  used.  Fifty  samples  were  used  to  train  the  new  algorithm 
and  the  remaining  50  were  used  to  test  it. 

Finally,  a  set  of  94  absorbance  spectra  of  synthetic  mixtures  of  benzene, 
cyclohexane,  n-heptane,  and  iso-octane  was  used  to  predict  the  concentrations 
of  benzene.  These  spectra  were  obtained  from  a  Digilab  FTS  15C  Fourier- 
Transform  infrared  spectrometer  at  a  resolution  of  8  cm"1.  A  spectral  range 
of  1.67  to  2.5  pm  was  considered.  Of  the  94  measured  spectra,  47  samples 
were  used  to  train  the  algorithm  and  45  samples  were  used  to  test  it.  Two 
sample  spectra  were  discarded  because  of  verified  instrumental  error  during 
their  acquisition. 

III.  RESULTS 

WWlMi 

A-  Exper1rents  w1th  simulated  spectra  simplified 

the  evaluation  of  the  row-reduction  algorithm  under  varying  con¬ 
ditions.  Several  general  trends  were  apparent  from  these  experiments:  1) 
when  no  noise  was  added  to  the  simulated  spectra,  the  algorithm  generated 
an  exact  solution  to  Eq.  (1)  with  a  SEE  of  0;  2)  when  noise  was  selectively 
added,  the  algorithm  consistently  chose  those  wavelengths  with  the  least 
noise;  3)  when  additional  simulated  wavelengths  were  added  but  which  con¬ 
tained  no  information  (i.e.  were  not  related  to  sample  composition),  they 
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were  never  chosen  when  the  signal-to-noise  ratio  of  the  overall  spectrum 
was  greater  than  12;  4)  when  the  signal-to-noise  of  a  spectrum  was  less 
than  12,  the  algorithm  was  less  able  to  distinguish  between  wavelengths 
containing  information  and  those  containing  no  information.  The  probability 
that  an  invalid  wavelength  would  be  chosen  increased  as  the  signal-to-noise 
ratio  decreased.  These  trends  show  that  the  new  row-reduction  algorithm 
is  viable  as  long  as  the  signal-to-noise  ratio  of  a  spectrum  is  large  enough 
to  make  any  data  reduction  worthwhile. 


B-  The  correlation  with  pH  In 

mixtures  of  methyl-orange  and  methyl-red  solution  spectra  gave  statistical 
correlations  ranging  from  r  =  0.9798  to  r  *  0.9999,  as  shown  in  Table  I. 
These  results  clearly  indicate  that  the  row-reduction  algorithm  performs 
well  for  real  solutions  where  Beer’s  law  is  obeyed. 


C.  Determination  of  Protein  in  Wheat. 
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The  correlation  for  protein  in 


wheat  is  shown  in  Table  II.  These  results  compare  well  with  those  obtained 


by  the  technique  of  curve  fitting.11  It  should  be  noted  that  the  number 
of  samples  and  the  number  of  wavelengths  examined  at  a  time  in  the  row- 
reduction  algorithm  are  necessarily  equal  because  of  the  fundamental  re¬ 
lationship  of  m  independent  equations  for  m  Independent  unknowns  in  linear 
algebra.  The  correlation  obtained  by  the  row-reduction  method  using  7 
wavelengths  is  shown  graphically  in  Fig.  2;  the  wavelengths  and  their 
respective  B  coefficients  (cf.  Eq.  1)  are  listed  in  Table  III. 


°-  The  correlation 
for  benzene  in  hydrocarbons  is  shown  graphically  in  Fig.  3;  the  wavelengths 
used  and  their  respective  B  coefficients  arelistedin  Table  IV.  The  two 


samples  plotted  as  circles  In  Fig.  3  were  known  to  be  In  error  because  of 
an  Inadequate  Instrumental  N2  purge.  These  latter  samples  have  not  been 
used  In  calculating  the  least-squares  line*  but  were  retained  on  the  plot 
to  Illustrate  the  possible  effect  and  magnitude  of  Instrumental  errors. 
Although  47  points  are  plotted  in  Fig.  3,  the  precision  of  the  prediction 
is  such  that  many  of  the  points  are  not  spatially  resolvable. 

IV.  DISCUSSION 

'WWVAAAAA, 


diction  of  protein  in  wheat  shown  in  Table  II  and  Fig.  2  verifies  that  the 
row-reduction  algorithm  is  competitive  with  other  regression  techniques 
as  far  as  standard  error  of  prediction  Is  concerned.  There  are  other  con¬ 
siderations,  however,  which  favor  row-reduction  over  multilinear  regression. 
One  of  these  considerations  Is  computation  time. 

The  number  of  multiplications  and  divisions  required  to  solve  Eq.  1 
for  a  single  matrix  is  equal  to: 

C-Nw3  +  3(Ns  +  1)(NW)2  +  (3N$  +  4)  Nw  -6N$]/6  (11) 

where  Is  the  number  of  wavelengths  In  Eq.  1  and  N$  is  the  number  of  samples 
in  the  training  set.  In  turn,  the  total  number  of  matrices  which  must  be 
solved  to  obtain  a  calibration  via  row  reduction  Is  approximately  the  total 
number  of  wavelengths  to  be  considered  (N^)  times  the  number  of  passes  through 
the  wavelength  set.  Because  the  number  of  passes  Is  usually  2,  the  number 
of  matrices  to  be  solved  Is  ordinarily  2N^.  Multiplying  the  number  of  multi¬ 
plications  and  divisions  per  matrix  (Eq.  11)  by  the  number  of  matrices  (2N^) 


gives  the  total  number  of  computations  (NR): 

Nr  =  CNX3C-NW3  +  3(NS  +  1)(HW2)  +  (3N$  +  4)  Nw  -  6Ng]/3  (12) 

The  number  of  multiplications  and  divisions  necessary  to  obtain  a  NIRA 
calibration  by  "all  possible  pairs"  or  "all  possible  triplets"  stepwise 
regression  can  be  deduced  by  a  two-part  computation.  The  first  part  is 
the  calculation  of  the  cross  terms: 

(NJ(NX  +  1) 

#  Cross  Terms  =  — - — ^ -  (13) 

Each  of  these  terms  is  composed  of  N$  multiplications  so  the  total  number 

of  computations  for  determining  the  cross  terms  is: 

* 

(N.)(N,  +  1)(NJ 

#  Cross  Term  Multiplications  =  — - 2 - 5 —  (14) 

The  second  part  of  the  regression  calculation  is  the  inversion  of 
matrices.  Each  1-by-i  matrix  inversion  requires  i*  multiplications  and 
divisions.  The  number  of  matrices  to  be  inverted  by  the  all -possible-pairs 
stepwise  regression  is 

(NJ(N.  -  1) 

— ^ — § - +  2(NW  -2)(NX)  (15) 

The  corresponding  number  for  the  all -possible- triples  stepwise  regression  is 


(16) 


(Na)(Nx  -  1)(Na  -  2) 
6 


+  2<Nw  "  3>NX 


where  both  Eq.  15  and  16  assume  one  checkback  per  wavelength  addition. 

From  Eqs.  15  and  16  and  the  number  of  multiplications  and  divisions 
required  to  invert  each  matrix,  Eqs.  17  and  18  can  be  obtained. 

No.  of  calculations  In 
all-possible-pairs  stepwise 
regression 

No.  of  calculations  in  all-  .  . 

possible-triples  stepwise  *  (9/2)(NA)(NA  -  1)(NA  -  2)  +  2  \  i3(NA)  (18) 

regression 

An  examination  of  Eq.  12  and  Eqs.  14  plus  17  or  Eqs.  14  plus  18 
gives  a  semi quantitative  basis  of  comparision  of  the  row-reduction  and 
regression  methods.  This  comparison  is  tabulated  in  Table  V.  It  can  be 
observed  that  row- reduction  becomes  much  more  efficient  as  NA  »  N$. 

B.  Other  Advantages  of  Row  Reduction.  Row  reduction  has  several 
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advantages  over  regression  other  than  computational  efficiency.  These 
advantages  include  an  increased  immunity  to  baseline  drift  and  to  over- 
fi tting. 

If  spectral  baseline  drift  occurs,  all  wavelengths  shift  up  or  down 
together.  Therefore,  the  offset  caused  by  these  shifts  can  be  avoided  if 
the  B  coefficients  of  Eq.  (1)  add  to  zero.  The  set  of  typical  B  coefficients 
shown  in  Table  III,  calculated  by  row  reduction,  add  very  nearly  to  zero. 

This  feature  is  Inherent  in  the  row-reduction  algorithm  and  avoids  the 
problem  of  forcing  the  sum  of  the  regression  coefficients  to  zero.  Regres¬ 
sion  techniques  do  not  inherently  possess  this  feature. 


Overfitting  occurs  when  the  solution  to  Eq.  (1)  reflects  trends  in  the 
training  sample  set  that  are  not  present  in  the  prediction  set.  Row-reduction 
helps  reduce  the  likelihood  of  overfitting  through  Its  use  of  only  the  most 
orthogonal  samples  to  determine  the  B  coefficients.  This  selection  prevents 
averaging  or  diluting  the  uniqueness  of  individual  samples  and  forces  the 
prediction  to  be  valid  for  the  most  unusual  samples  in  the  training  set,  not 
for  the  most  "typical"  samples. 

Finally,  row-reduction  allows  an  a  priori  test  for  overfitting  even  if 
the  samples  whose  constituents  are  to  be  predicted  have  not  been  chemically 
determined  (i.e.,  are  not  part  of  the  training  or  prediction  sets).  In 
particular,  if  the  spectrum  of  a  new  sample  (at  the  wavelengths  used  in 
Eq.  (1))  cannot  be  formed  by  some  combination  of  the  spectra  of  the  samples 
used  to  solve  Eq.  (1),  that  sample  cannot  be  accurately  predicted.  This 
method  for  detecting  the  presence  of  overfitting  will  be  discussed  in  a 
subsequent  paper. 

V.  CONCLUSION 

'X/WWWW \s 

The  new  row-reduction  algorithm  appears  to  be  a  valid  technique  for 
finding  the  correlation  between  chemical  composition  and  the  absorbance  or 
reflectance  spectra  for  spectrally  and  chemically  complex  samples.  Row 
reduction  has  the  advantages  of  computational  ease  and  Increased  resistance 
to  spectral  errors  compared  to  regression  methods.  Finally,  row  reduction 
is  conceptually  more  facile  than  a  multilinear  regression,  a  feature  which 
should  aid  future  research  in  and  interpretation  of  the  NIRA  technique. 
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TABLE  I.  Correlation  with  pH  In  Mixtures  of  Methyl-Red  and  Methyl- 
Orange  Solutions. 

Correlation 

Number  of  Wavelengths  Coefficient 

Chemical  System  Number  of  Samples  _ Retained  (r  value) 


led 


0 

0 


0 


4 

4 

5 


2 

2 

2 


.9798 

.9999 

.9934 


^  «•  a  ••  •  —  Jl  .«  <•_  -»  •  ..  •  j*.#.  •  -  •  .  «_  1  ^  *  *,*,'.*  ,  '  .*■  .  *_  *  _  ‘ ,  , 


TABLE  II.  Prediction  of  Percent  Protein  In  Wheat  Using  the  Row-Reduction  Algorithm 
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Number  of  Wavelengths 
Retained  for  Prediction* 
by  Row-Reduction 

Algorithm 

Number  of  Samples 

Used  -  Both  Methods 

Reliability  of 

Predicted  Percent  Protein 

Row- Reduction  Reference  11 

SEE  SEP  SEE  SEP 

2 

2 

1.36 

1.15 

2.28 

2.16 

3 

3 

0.40 

0.46 

2.30 

2.30 

4 

4 

0.38 

0.38 

0.243 

0.30 

5 

5 

0.36 

0.36 

0.24 

0.30 

6 

6 

0.31 

0.  35 

0.243 

0.30 

7 

7 

0.27 

0.  26 

0.14 

0.15 

TABLE  III.  Seven-wavelength  Correlation  for  Protein  in  Wheat 
(See  also  Fig.  2) 


WAVELENGTH 

(ym) 


MULTIPLIER 
(B  value) 


1.73 

900.3 

1.74 

-967.6 

1.86 

5.1 

1.97 

34.2 

2.15 

1.3 

2.17 

42.9 

2.52 

-18.0 

TABLE  IV.  Eight-wavelength  Correlation  for  Benzene  In  Hydrocarbons 
(See  also  Fig.  3) 


WAVELENGTH 

( m ) 

2.011 

2.023 

2.164 

2.168 

2.171 

2.175 

2.179 

2.189 


MULTIPLIER 
(B  value) 

0.00073 

0.00971 

0.11889 

-0.26828 

0.29825 

-0.24175 

0.12348 

-0.03678 


TABLE  V.  Number  of  Computations  for  Finding  the  Best  5  and  6-Wavelength  Correlations 
by  Row-Reduction  and  Regression  Methods.* 


Number  of  Wavelengths 
to  Search 

Number  of 
Samples 

Regression  methods 

All  possible  All  possible 

Pairs  stepwise  Triples  stepwise 

Row 

Reduction 

19 

10 

20  K 

43  K 

12  K 

19 

25 

21  K 

44  K 

32  K 

19 

50 

27  K 

51  K 

64  K 

140 

10 

297  K 

12  M 

90  K 

140 

25 

446  K 

12  M 

232  K 

140 

50 

692  K 

13  M 

470  K 

700 

10 

5.0  M 

515  M 

449  K 

700 

25 

8.7  M 

519  M 

1.1  M 

700 

50 

14  M 

525  M 

2.4  M 

*  K  =  x  103;  M  =  x  10® 


FIGURE  CAPTIONS 


Figure  1.  Data  configuration  of  the  row-reduction  matrix. 

Figure  2.  Predicted  vs.  actual  percent  protein  in  wheat  using  new  row- 
reduction  algorithm.  Fifty  samples  were  predicted  using 
seven  wavelengths  with  a  SEP  of  0.26%  protein. 

Figure  3.  Predicted  vs.  actual  percent  benzene  in  hydrocarbons .  Crosses 
represent  valid  data  points.  Circles  represent  data  points 
with  instrumental  errors.  Forty-five  samples  were  predicted 
using  eight  wavelengths  with  a  SEP  of  1.01%  benzene. 
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